Robust bayesian mixture modeling

ABSTRACT

A Bayesian treatment of mixture models is based on individual components having Student distributions, which have heavier tails compared to the exponentially decaying tails of Gaussians. The mixture of Student distribution components is characterized by a set of modeling parameters. Tractable approximations of the posterior distributions of individual modeling parameters are optimized and used to generate a data model for a set of input data.

TECHNICAL FIELD

The invention relates generally to statistical analysis and machinelearning algorithms, and more particularly to robust Bayesian mixturemodeling.

BACKGROUND

Mixture models are common tools of statistical analysis and machinelearning. For example, when trying to model a statistical datadistribution, a single Gaussian model may not adequately approximate thedata, particularly when the data has multiple modes or clusters (e.g.,has more than one peak).

As such, a common approach is to use a mixture of two or more Gaussiancomponents, fitted with a maximum likelihood, to model such data.Nevertheless, even a mixture of Gaussians (MOG) presents modelingproblems, such as inadequate modeling of outliers and severeoverfitting. For example, there are singularities in the likelihoodfunction arising from the collapse of components onto individual datapoints—a pathological result.

Some problems with a pure MOG can be elegantly addressed by adopting aBayesian framework to marginalize over the model parameters with respectto appropriate priors. The resulting Bayesian model likelihood can thenbe maximized with respect to the number of Gaussian components in themixture, if the goal is model selection, or combined with a prior overthe number of the components, if the goal is model averaging. Onebenefit to a Bayesian approach using a mixture of Gaussians is theelimination of maximum likelihood singularities, although it still lacksrobustness to outliers. In addition, in the Bayesian model selectioncontext, the presence of outliers or other departures from the empiricaldistribution of Gaussianity can lead to errors in the determination ofthe number of clusters in the data.

SUMMARY

Implementations described and claimed herein address the foregoingproblems using a Bayesian treatment of mixture models based onindividual components having Student distributions, which have heaviertails compared to the exponentially decaying tails of Gaussians. Themixture of Student distribution components is characterized by a set ofmodeling parameters. Tractable approximations of the posteriordistributions of individual modeling parameters are optimized and usedto generate a data model for a set of input data.

In some implementations, articles of manufacture are provided ascomputer program products. One implementation of a computer programproduct provides a computer program storage medium readable by acomputer system and encoding a computer program. Another implementationof a computer program product may be provided in a computer data signalembodied in a carrier wave by a computing system and encoding thecomputer program.

The computer program product encodes a computer program for executing acomputer process on a computer system. A modeling parameter is selectedfrom a plurality of modeling parameters characterizing a mixture ofStudent distribution components. A tractable approximation of aposterior distribution for the selected modeling parameter is computedbased on an input set of data and a current estimate of a posteriordistribution of at least one unselected modeling parameter in theplurality of modeling parameters. A lower bound of a log marginallikelihood is computed as a function of current estimates of theposterior distributions of the modeling parameters. The currentestimates of the posterior distributions of the modeling parametersinclude the computed tractable approximation of the posteriordistribution of the selected modeling parameter. A probability densitythat models the input set of data is generated, if the lower bound issatisfactorily optimized. The probability density includes the mixtureof Student distribution components, which is characterized by thecurrent estimates of the posterior distributions of the modelingparameters.

In another implementation, a method is provided. A modeling parameter isselected from a plurality of modeling parameters characterizing amixture of Student distribution components. A tractable approximation ofa posterior distribution for the selected modeling parameter is computedbased on an input set of data and a current estimate of a posteriordistribution of at least one unselected modeling parameter in theplurality of modeling parameters. A lower bound of a log marginallikelihood is computed as a function of current estimates of theposterior distributions of the modeling parameters. The currentestimates of the posterior distributions of the modeling parametersinclude the computed tractable approximation of the posteriordistribution of the selected modeling parameter. A probability densitythat models the input set of data is generated, if the lower bound issatisfactorily optimized. The probability density includes the mixtureof Student distribution components, which is characterized by thecurrent estimates of the posterior distributions of the modelingparameters.

In another implementation, a system is provided. A tractableapproximation module computes a tractable approximation of a posteriordistribution for the selected modeling parameter based on an input setof data and a current estimate of a posterior distribution of at leastone unselected modeling parameter in the plurality of modelingparameters. A lower bound optimizer module computes a lower bound of alog marginal likelihood as a function of current estimates of theposterior distributions of the modeling parameters. The currentestimates of the posterior distributions of the modeling parametersinclude the computed tractable approximation of the posteriordistribution of the selected modeling parameter. A data model generatorgenerates a probability density modeling the input set of data, if thelower bound is satisfactorily optimized. The probability densityincludes the mixture of Student distribution components. The mixture ofStudent distribution components is characterized by the currentestimates of the posterior distributions of the modeling parameters.

Other implementations are also described and recited herein.

BRIEF DESCRIPTIONS OF THE DRAWINGS

FIG. 1 illustrates exemplary probability distributions for modeling adata set.

FIG. 2 illustrates exemplary operations for robust Bayesian mixturemodeling.

FIG. 3 illustrates an exemplary robust Bayesian mixture modeling system.

FIG. 4 illustrates a system useful for implementing an embodiment of thepresent invention.

DETAILED DESCRIPTION

FIG. 1 illustrates exemplary probability distributions 100 for modelinga data set. A single Gaussian distribution 102 models an input data setof independent identically distributed (idd) data 104. Note that themean 106 of the single Gaussian distribution 102 is pulled substantiallyto the right in order accommodate the outlier data element 106, therebycompromising the accuracy of the Gaussian model as it applies to thegiven data set 104. In addition, the standard deviation of thedistribution 102 is undesirably increased by the outlier 106.

In order to improve the modeling of the data 104, a mixture of Gaussiandistributions 108 may be used. However, fitting the mixture 108 to thedata set 104 using a maximum likelihood approach does not yield a usableoptimal number of components because the maximum likelihood approachfavors an ever more complex model, leading to the undesirable extreme ofindividual, infinite magnitude Gaussian distribution component forindividual data point. While overfitting of Gaussian mixture models canbe addressed to some extent using Bayesian inference, even then,Gaussian mixture models continue to lack robustness as to outliers.

A mixture of Student distributions 110 can demonstrate a significantimprovement in robustness as compared to a mixture of Gaussiandistributions. However, there is no closed form solution for maximizingthe likelihood under a Student distribution. Furthermore, the maximumlikelihood approach does not address the problem of overfitting.Therefore, a mixture of Student distributions 110 combined with atractable Bayesian treatment to fit the Student mixture to the inputdata 104 addresses these issues, as illustrated in FIG. 1. However, nosatisfactory method or system for obtaining a tractable Bayesiantreatment of Student mixture distributions has previously beendemonstrated. As such, in one implementation, robust Bayesian mixturemodeling obtains a tractable Bayesian treatment of Student mixturedistributions based on variational inference. In another implementation,a tractable approximation may be obtained using Monte Carlo-basedtechniques.

Robust Bayesian mixture modeling is based on a mixture of componentdistributions given by a multivariate Student distribution, also knownas a t-distribution. A Student distribution represents a generalizationof a Gaussian distribution and, in the limit ν→∞, the Studentdistribution reduces to a Gaussian distribution with mean μ andprecision Λ (i.e., inverse covariance). For finite values of ν, theStudent distribution has heavier tails than the corresponding Gaussianhaving the same μ and Λ.

A Student distribution over a d-dimensional random variable x may berepresented in the following form: $\begin{matrix}{{S\left( {{x❘\mu},\Lambda,v} \right)} = {\frac{{\Gamma\left( \frac{v + d}{2} \right)}{\Lambda }^{\frac{1}{2}}}{{\Gamma\left( \frac{v}{2} \right)}\left( {v\quad\pi} \right)^{\frac{d}{2}}}\left( {\frac{\Delta^{2}}{v} + 1} \right)^{- \frac{v + d}{2}}}} & (1)\end{matrix}$where Δ²=(x−μ)^(T)Λ(x−μ) represents the squared Mahalanobis distancefrom x to μ.

In contrast to the Gaussian distribution, no closed form solution formaximizing likelihood exists under a Student distribution. However, theStudent distribution may be represented as an infinite mixture of scaledGaussian distributions over x with an additional random variable u,which acts as a scaling parameter of the precision matrix Λ, such thatthe Student distribution may be represented in the following form:$\begin{matrix}{{S\left( {{x❘\mu},\Lambda,v} \right)} = {\int_{0}^{\infty}{{N\left( {{x❘\mu},{\Lambda\quad u}} \right)}{G\left( {{u❘\frac{v}{2}},\frac{v}{2}} \right)}\quad{\mathbb{d}u}}}} & (2)\end{matrix}$where N (x|μ,Λ) denotes the Gaussian distribution with mean μ andprecision matrix Λu, and G(u|a,b) represents the Gamma distribution. Foreach observation of x (i.e., of N observations), a correspondingimplicit posterior distribution over the variable u exists.

The probability density of mixtures of M Student distributions may berepresented in the form: $\begin{matrix}{{p\left( {{x❘\left\{ {\mu_{m},\lambda_{m},v_{m}} \right\}},\pi} \right)} = {\sum\limits_{m = 1}^{M}\quad{\pi_{m}{S\left( {{x❘\mu_{m}},\Lambda_{m},v_{m}} \right)}}}} & (3)\end{matrix}$where the mixing coefficients π=(π₁, . . . , π_(M)) ^(T) satisfy0≦π_(m)≦1 and ${\sum\limits_{m = 1}^{M}\quad\pi_{m}} = 1.$

In order to find a tractable treatment of this model, the mixturedensity of Equation (3) may be expressed in terms of a marginalizationover a binary latent labeling variable s of dimensions N×M (i.e., Nrepresenting the number of data elements and M representing the numberof Student distribution components in the mixture) and the unobservedvariable u_(nm), also of dimensions N×M when applied to a mixture.Variable s has components {S_(nj)} such that s_(nm)=1 and s_(nj)=0 forj≠m, resulting in: $\begin{matrix}{{p\left( {{x_{n}❘s},\left\{ {\mu_{m},\Lambda_{m},v_{m}} \right\}} \right)} = {\prod\limits_{n,m}^{N,M}\quad{S\left( {{x❘\mu_{m}},\Lambda_{m},v_{m}} \right)}^{s_{n\quad m}}}} & (4)\end{matrix}$with a corresponding prior distribution over s of the form:$\begin{matrix}{{p\left( {s❘\pi} \right)} = {\prod\limits_{n,m}^{N,M}\quad\pi_{m}^{s_{nm}}}} & (5)\end{matrix}$It can be verified that marginalization of the product of Equations (4)and (5) over the latent variable s recovers the Student distributionmixture of Equation (3).

An input data set X includes N idd observations x_(n), where n=1, . . ., N, which are assumed to be drawn independently from the distributioncharacterized by Equation (3). Thus, for each data observation X_(n), acorresponding discrete latent variable s_(n), specifies which componentof the mixture generated that data point, and continuous latent variableu_(nm) specifies the scaling of the precision for the correspondingequivalent Gaussian distribution from which the data was hypotheticallygenerated.

In addition to the prior distribution over s, prior distributions forthe modeling parameters μ_(m),Λ_(m), and π, are used in a Bayesiantreatment of probability density estimation. As such, distributions ofthe modeling parameters are used rather than the parameters themselves.In one implementation, for tractability, conjugate priors from theexponential family have been chosen in the form:p(μ_(m))=N(μ_(m) |m,ρI)  (6)p(Λ_(m))=W(Λ_(m) |W ₀η₀)  (7)p(π)=D(π|α)  (8)wherein W (Λ|□,□) represents the Wishart distribution and D (π|□)represents the Dirichlet distribution. The prior p(u) is implicitlydefined in Equation (2) to equal the Gamma distribution${G\left( {{u❘\frac{v}{2}},\frac{v}{2}} \right)}.$

It should be understood that prior distributions may be selected fromother members of the exponential family in alternative embodiments. Theparameters of the prior distributions on μ and Λ are chosen to givebroad distributions (e.g., in one implementation, m₀=0, ρ₀=10⁻³, W₀I,η₀32 1. For the prior distribution over π, α={α_(m)} are interpreted aseffective numbers of prior observations, with α_(m)=10⁻³.

Exact inference of the Bayesian model is intractable. However, with thechoice of exponential distributions to represent the prior distributionsof the modeling parameters, tractable approximations are possible. Inone implementation, for example, a tractable approximation may beobtained through Monte Carlo techniques.

In another implementation, variational inference may be employed toobtain tractable approximations of the posterior distributions over theidentified stochastic modeling parameters, which in one implementationincludes {μ_(m),Λ_(m)}, π, and {s_(m),u_(n)} . (Another modelingparameter, ν, is treated in a deterministic (i.e., non-stochastic)fashion; however, only one such parameter exists per mixturecomponent.).

In variational inference, the log-marginal likelihood is maximized. Oneform of the log-marginal likelihood is shown: $\begin{matrix}{{\ln{\prod\limits_{n}^{N}\quad{p\left( {{x_{n}❘m_{0}},\rho_{0},W_{0},\eta_{0}} \right)}}} = {\ln{\int{\prod\limits_{n}^{N}\quad{{p\left( {x_{n},{u_{n}❘\mu},\Lambda,v} \right)}{p\left( {{\mu ❘m_{0}},\rho_{0}} \right)}{p\left( {{\Lambda ❘W},\eta} \right)}{\mathbb{d}u_{n}}{\mathbb{d}\mu}\quad d\quad\Lambda}}}}} & (9)\end{matrix}$

This quantity cannot be maximized directly. However, Equation (9) can bere-written as follows: $\begin{matrix}{{\ln{\int{{p\left( {X❘\theta} \right)}{P\left( {{\theta ❘m_{0}},\rho_{0},W_{0},\eta_{0},v} \right)}{\mathbb{d}\theta}}}} = {{\int{{q(\theta)}\ln\frac{p\left( {X,{\theta ❘m_{0}},\rho_{0},W_{0},\eta_{0},v} \right)}{q(\theta)}{\mathbb{d}\theta}}} - {\int{{q(\theta)}\ln\frac{p\left( {\theta,{X❘m_{0}},\rho_{0},W_{0},\eta_{0},v} \right)}{q(\theta)}{\mathbb{d}\theta}}}}} & (10)\end{matrix}$where X={x_(n)}, θ={μ,Λ,u}, u={u_(n)}, and q(θ) is the so-calledvariational distribution over μ,Λ, and u, such that q(θ)=q(μ)q(Λ)q(u)(assuming q(μ), q(Λ), and q(u) are independent).

The second term of Equation (10) is the Kullback-Leibler (KL) divergencebetween q(θ) and p(θ|{x_(n)},m₀,ρ₀,W₀,ρ₀,ν), which is non-negative andzero only if the two distributions are identical. Thus, the first termcan be understood as the lower bound of the log-marginal likelihoodΛ(q). Therefore, seeking to minimize the second term of Equation (10)amounts to maximizing the lower bound Λ(q).

Accordingly, one way to represent the lower bound A(q) is shown:$\begin{matrix}{{L({qq})} \equiv {\int{{q(\theta)}\ln\left\{ \frac{p\left( {X,\theta} \right)}{q(\theta)} \right\}{\mathbb{d}\theta}}} \leq {\ln\quad{p(X)}}} & (11)\end{matrix}$where θ represents the set of all unobserved stochastic variables.

In Equation (11), q(θ) represents the variational posteriordistribution, and p(X,θ) is the joint distribution over the stochasticmodeling parameters. The difference between the right hand side ofEquation (11) and Λ(q) is given by the KL divergence KL(q||p) betweenthe variational posterior distribution q(θ) and the true posteriordistribution p(θ,X).

Given the priors of Equations (5), (6), (7), and (8), the variationalposterior distributions q(•) for s, π, μ_(m), Λ_(m), and u may becomputed.

For q(s), where s represents the labeling parameters: $\begin{matrix}{{q(s)} = {\prod\limits_{n,m}^{N,M}\quad p_{nm}^{s_{nm}}}} & (12)\end{matrix}$where $\begin{matrix}{p_{nm} = \frac{r_{nm}}{\sum\limits_{m = 1}^{M}\quad r_{{nm}^{\prime}}}} & (13)\end{matrix}$where, in turn, $\begin{matrix}{r_{nm} = {\exp\left( {\left\langle {\ln\quad\pi_{m}} \right\rangle + {\frac{1}{2}\left\langle {\ln{\Lambda_{m}}} \right\rangle} + {\frac{d}{2}\left\langle {\ln\quad u_{nm}} \right\rangle} - \frac{\left\langle u_{nm} \right\rangle\left\langle \Delta_{nm}^{2} \right\rangle}{2} - {\frac{d}{2}\ln\quad 2\quad\pi}} \right)}} & (14)\end{matrix}$Although the last term in the argument for the exponential cancels outin Equation (13). In addition, $\begin{matrix}{\left\langle {\ln{\Lambda_{m}}} \right\rangle = {{d\quad\ln\quad 2} - {\ln{W}} + {\sum\limits_{i = 1}^{d}\quad{\Psi\left( \frac{\eta + 1 - i}{2} \right)}}}} & (15) \\{\left\langle \Delta_{n}^{2} \right\rangle = {{x_{n}^{T}\eta_{m}W_{m}x_{n}} - {2x_{n}^{T}\eta_{m}W_{m}m_{m}} + {{Tr}\left\lbrack {\left( {{m_{m}m_{m}^{T}} + R_{m}^{- 1}} \right)\eta_{m}W_{m}} \right\rbrack}}} & (16)\end{matrix}$and(S_(nm))=p_(nm)  (17)

For q(π), where π represents the mixing coefficients:q(π)=D(π|α)  (18)where $\begin{matrix}{\alpha_{m} = {{\sum\limits_{n = 1}^{N}\quad\left\langle s_{nm} \right\rangle} + {\hat{\alpha}}_{m}}} & (19)\end{matrix}$and $\begin{matrix}{\left\langle \pi_{m} \right\rangle = \frac{\alpha_{m}}{\alpha_{0}}} & (20)\end{matrix}$where $\begin{matrix}{{\Psi(a)} = \frac{{\mathbb{d}\ln}\quad{\Gamma(a)}}{\mathbb{d}a}} & (21)\end{matrix}$and m′=1 , . . . , M. Furthermore, (1nπ_(m))=Ψ(α_(m))=Ψ(α₀), where$\alpha_{0} = {\sum\limits_{m^{\prime}}^{\quad}\quad\alpha_{m^{\prime}}}$

For q(μ_(m)), where μ_(m) represents the mean of the m^(th) Studentdistribution component in the mixture:q(μ_(m))=N(μ_(m) |m _(m) ,R _(m))  (22)where $\begin{matrix}{{R_{m} = {{\left\langle \Lambda_{m} \right\rangle{\sum\limits_{n = 1}^{N}\quad\left\langle w_{nm} \right\rangle}} + {\rho_{0}I}}}{m_{m} = {R_{m}^{- 1}\left( {{\left\langle \Lambda_{m} \right\rangle{\sum\limits_{n = 1}^{N}\quad{\left\langle w_{nm} \right\rangle x_{n}}}} + {\rho_{0}m_{0}}} \right)}}} & (23)\end{matrix}$and(w _(nm))=(s_(nm))(u _(nm))  (24)

For q(Λ_(m)), where Λ_(m) represents the precision matrix of the m^(th)Student distribution component in the mixture:q(Λ_(m))=W(Λ_(m)|_(m) W _(m),η_(m))  (25)where $\begin{matrix}{W_{m}^{- 1} = {W_{0}^{- 1} + {\sum\limits_{n}^{N}\quad{\left\langle w_{nm} \right\rangle\left( {{x_{n}x_{n}^{T}} - {x_{n}m_{m}^{T}} - {m_{m}x_{n}^{T}} + \left( {{m_{m}m_{m}^{T}} + R_{m}^{- 1}} \right)} \right)}}}} & (26)\end{matrix}$and $\begin{matrix}{{\eta_{m} = {\eta_{0} + {{\hat{s}}_{m}\quad{where}}}}{{\hat{s}}_{m} = {\sum\limits_{n}^{N}\quad{\left\langle s_{nm} \right\rangle.}}}} & (27)\end{matrix}$

For q(u), where u represents the scaling parameters of the precisionmatrices:q(u _(nm))=G(u _(nm)|α_(nm) ,b _(nm))  (28)where $\begin{matrix}{a_{nm} = \frac{v_{m} + {\left\langle s_{nm} \right\rangle d}}{2}} & (29)\end{matrix}$where d represents the dimensionality of the data, $\begin{matrix}{b_{nm} = \frac{v_{m} + {\left\langle s_{nm} \right\rangle\left\langle \Delta_{nm}^{2} \right\rangle}}{2}} & (30)\end{matrix}$and(Δ² _(nm))=x ^(T) _(n)η_(m) W _(m) m _(m) +Tr[(m _(m) m _(m) ^(T) R ⁻¹_(m))η_(m) W _(m])  (31)

A constrained family of distributions for q(O) is chosen such that thelower bound Λ(q) becomes tractable. The optimal member of the family canthen be determined by maximization of Λ(q), which is equivalent tominimization of the KL divergence. Thus, the resulting optimal solutionfor q(θ) represents an approximation of the true posterior ofp(θ|{x_(n)},m₀,ρ₀,W₀,η₀,ν), assuming a factorized variationaldistribution for q(θ) of:q(θ)=q({μ_(m)})q({Λ_(m)})q(π)q({s _(n)})q({u _(n)})  (32)

A free-form variational optimization is now possible with respect toeach of the individual variational factors of Equation (32). Because thevariational factors are coupled, the variational approximations of thefactors are computed iteratively by first initializing thedistributions, and then cycling to each factor in turn and replacing itscurrent estimate by its optimal solution, given the current estimatesfor the other factors, to give a new approximation of q(θ). Interleavedwith the optimization with respect to each of the individual variationalfactors, the lower bound is optimized with respect to each of thenon-stochastic parameters ν_(m) by employing standard non-linearoptimization techniques. The lower bound Λ(q) is then computed using thenew approximation of q(θ) for the current iteration

In one implementation, the iteration continues until the lower boundΛ(q) changes by less than a given threshold. In an alternativeimplementation, q(θ) may also be tested prior to computation of thelower bound Λ(q) in each iteration, such that if the value of q(θ)changes by less than another given threshold, then the iteration skipsthe computation and testing of the lower bound Λ(q) and exits the loop.In yet another implementation, individual factors of Equation (32) maybe tested to determine whether to terminate the optimization of themodeling parameters.

In the described approach, approximate posterior distributions of thestochastic modeling parameters {μ_(m),Λ_(m)}, π, and {s_(m),u_(n)}, aswell as a value of the modeling parameter ν, are determined. Given thesemodeling parameters, the Student mixture density of Equation (3) can beobtained to model the input data.

FIG. 2 illustrates exemplary operations 200 for robust Bayesian mixturemodeling. A receiving operation 202 receives prior distributions of eachmodeling parameter in the set of modeling parameters for a mixture ofStudent distributions. In one implementation, the prior distributionsmay be computed using the Equations (5), (6), (7), and (8), althoughother prior distributions may be used in alternative embodiments. Assuch, an operation of computing the prior distributions (not shown) mayalso be included in an alternative implementation.

Another receiving operation 204 receives the independent, identicallydistributed data. Exemplary data may include without limitation auditoryspeech data from an unknown number of speakers, where determining thecorrect number of speakers is part of the modeling process and imagesegmentation data from images containing few large and relativelyhomogeneous regions as well as several very small regions of differentcharacteristics (outlier regions), where modeling of the few largerregions should not be notably affected by the presence of the outlierregions.

Yet another receiving operation 206 receives initial estimates of theposterior distributions for a set of modeling parameters for a mixtureof Student distributions. The initial estimates may be received fromanother process or be determined in a determining operation (not shown)using a variety of methods, including a random approach. However, theoptimization of the modeling parameter can resolve quicker if theinitial estimates are closer to the actual posterior distributions. Inone implementation, heuristics are applied to the prior distributions todetermine these initial estimates. In a simple example, the posteriorsare set equal to the priors. A more elaborate example is toheuristically combine the priors with the results of fast,non-probabilistic methods, such as K-means clustering.

A selection operation 208 selects one of the modeling parameters in theset of modeling parameters. A computation operation 210 computes atractable approximation of the posterior distribution of the selectedmodeling parameter using the current estimates of the other modelingparameters. (In the first iteration, the current estimates of the othermodeling parameters represent their initial estimates.) In oneimplementation, the current state of the estimate of each modelingparameter is stored in a storage location, such as in a memory.

In the illustrated implementation, a variational inference methodproduces the tractable approximation. In one variational inferenceapproach, the tractable posterior distribution is approximated using theEquations (12), (18), (22), (25), and (28). The tractable approximationof the selected modeling parameter becomes the current estimate of thatmodeling parameter, which can be used in subsequent iterations.Alternatively, other approximation methods, including Monte Carlotechniques, may be employed.

A computation operation 212 computes the lower bound of the log marginallikelihood, such as by using Equation (11). If the lower bound isinsufficiently optimized according to the computation operation 212,such as by improving by greater than a given threshold or by some othercriterion, a decision operation 214 loops processing back to theselection operation 208, which selects another modeling parameter andrepeats operation 210 212 and 214 in a subsequent iteration. However, ifthe lower bound is sufficiently optimized, processing proceeds to ageneration operation 216, which generates the probability density of thedata based on the mixture of Student distributions characterized by thecurrent estimates of the modeling parameters (e.g., using Equation (4)).

It should be understood that the order of at least some of theoperations in the described process may be altered without altering theresults. Furthermore, other methods of determining whether the posteriordistribution approximations of the modeling parameters aresatisfactorily optimized, including testing whether the individualposterior distribution factors (e.g., q(s)) change little in eachiteration or testing whether the product (e.g., q(θ)) of the posteriordistribution factors changes little in each iteration.

FIG. 3 illustrates an exemplary robust Bayesian mixture modeling system300. Inputs to the system 300 include input data 302, initial estimatesof the modeling parameters 304, and prior distributions of the modelingparameters 306.

A modeling parameter selector 308 selects a modeling parameter that isto be approximated in each iteration. A tractable approximation module310 receives the inputs and the selection of the modeling parameter togenerate a tractable approximation of the selected modeling parameter(e.g., based on variational inference or Monte Carlo techniques). In oneimplementation, the tractable approximation module 301 also maintains acurrent state of the estimate of each modeling parameter in a storagelocation, such as in a memory.

Based on the current estimates of the modeling parameters, including thenew approximation of the selected modeling parameter, a lower boundoptimizer module 312 computes the lower bound of the log marginallikelihood. If the lower bound fails to satisfy an optimizationcriterion (such as by increasing more than a threshold amount), thelower bound optimizer module 312 triggers the modeling parameterselector module 308 to select another modeling parameter in a nextiteration. Otherwise, the current estimates of the modeling parametersare passed to a data model generator 314, which generates a data model316 including the probability density of the data based on the mixtureof Student distributions characterized by the current estimates of themodeling parameters (e.g., using Equation (4))

The exemplary hardware and operating environment of FIG. 4 forimplementing the invention includes a general purpose computing devicein the form of a computer 20, including a processing unit 21, a systemmemory 22, and a system bus 23 that operatively couples various systemcomponents include the system memory to the processing unit 21. Theremay be only one or there may be more than one processing unit 21, suchthat the processor of computer 20 comprises a single central-processingunit (CPU), or a plurality of processing units, commonly referred to asa parallel processing environment. The computer 20 may be a conventionalcomputer, a distributed computer, or any other type of computer; theinvention is not so limited.

The system bus 23 may be any of several types of bus structuresincluding a memory bus or memory controller, a peripheral bus, aswitched fabric, point-to-point connections, and a local bus using anyof a variety of bus architectures. The system memory may also bereferred to as simply the memory, and includes read only memory (ROM) 24and random access memory (RAM) 25. A basic input/output system (BIOS)26, containing the basic routines that help to transfer informationbetween elements within the computer 20, such as during start-up, isstored in ROM 24. The computer 20 further includes a hard disk drive 27for reading from and writing to a hard disk, not shown, a magnetic diskdrive 28 for reading from or writing to a removable magnetic disk 29,and an optical disk drive 30 for reading from or writing to a removableoptical disk 31 such as a CD ROM or other optical media.

The hard disk drive 27, magnetic disk drive 28, and optical disk drive30 are connected to the system bus 23 by a hard disk drive interface 32,a magnetic disk drive interface 33, and an optical disk drive interface34, respectively. The drives and their associated computer-readablemedia provide nonvolatile storage of computer-readable instructions,data structures, program modules and other data for the computer 20. Itshould be appreciated by those skilled in the art that any type ofcomputer-readable media which can store data that is accessible by acomputer, such as magnetic cassettes, flash memory cards, digital videodisks, random access memories (RAMs), read only memories (ROMs), and thelike, may be used in the exemplary operating environment.

A number of program modules may be stored on the hard disk, magneticdisk 29, optical disk 31, ROM 24, or RAM 25, including an operatingsystem 35, one or more application programs 36, other program modules37, and program data 38. A user may enter commands and information intothe personal computer 20 through input devices such as a keyboard 40 andpointing device 42. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit21 through a serial port interface 46 that is coupled to the system bus,but may be connected by other interfaces, such as a parallel port, gameport, or a universal serial bus (USB). A monitor 47 or other type ofdisplay device is also connected to the system bus 23 via an interface,such as a video adapter 48. In addition to the monitor, computerstypically include other peripheral output devices (not shown), such asspeakers and printers.

The computer 20 may operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer 49.These logical connections are achieved by a communication device coupledto or a part of the computer 20; the invention is not limited to aparticular type of communications device. The remote computer 49 may beanother computer, a server, a router, a network PC, a client, a peerdevice or other common network node, and typically includes many or allof the elements described above relative to the computer 20, althoughonly a memory storage device 50 has been illustrated in FIG. 4. Thelogical connections depicted in FIG. 4 include a local-area network(LAN) 51 and a wide-area network (WAN) 52. Such networking environmentsare commonplace in office networks, enterprise-wide computer networks,intranets and the Internet, which are all types of networks. When usedin a LAN-networking environment, the computer 20 is connected to thelocal network 51 through a network interface or adapter 53, which is onetype of communications device. When used in a WAN-networkingenvironment, the computer 20 typically includes a modem 54, a networkadapter, a type of communications device, or any other type ofcommunications device for establishing communications over the wide areanetwork 52. The modem 54, which may be internal or external, isconnected to the system bus 23 via the serial port interface 46. In anetworked environment, program modules depicted relative to the personalcomputer 20, or portions thereof, may be stored in the remote memorystorage device. It is appreciated that the network connections shown areexemplary and other means of and communications devices for establishinga communications link between the computers may be used.

In an exemplary implementation, a modeling parameter selector, atractable approximation module, a lower bound optimizer module, a datamodel generator, and other modules may be incorporated as part of theoperating system 35, application programs 36, or other program modules37. Initial modeling parameter estimates, input data, modeling parameterpriors, and other data may be stored as program data 38.

The embodiments of the invention described herein are implemented aslogical steps in one or more computer systems. The logical operations ofthe present invention are implemented (1) as a sequence ofprocessor-implemented steps executing in one or more computer systemsand (2) as interconnected machine modules within one or more computersystems. The implementation is a matter of choice, dependent on theperformance requirements of the computer system implementing theinvention. Accordingly, the logical operations making up the embodimentsof the invention described herein are referred to variously asoperations, steps, objects, or modules.

The above specification, examples and data provide a completedescription of the structure and use of exemplary embodiments of theinvention. Since many embodiments of the invention can be made withoutdeparting from the spirit and scope of the invention, the inventionresides in the claims hereinafter appended.

1. A method comprising: selecting a modeling parameter from a pluralityof modeling parameters characterizing a mixture of Student distributioncomponents; computing a tractable approximation of a posteriordistribution for the selected modeling parameter based on an input setof data and a current estimate of a posterior distribution of at leastone unselected modeling parameter in the plurality of modelingparameters; computing a lower bound of a log marginal likelihood as afunction of current estimates of the posterior distributions of themodeling parameters, the current estimates of the posteriordistributions of the modeling parameters including the computedtractable approximation of the posterior distribution of the selectedmodeling parameter; and generating a probability density modeling theinput set of data, the probability density including the mixture ofStudent distribution components, the mixture of Student distributioncomponents being characterized by the current estimates of the posteriordistributions of the modeling parameters, if the lower bound issatisfactorily optimized.
 2. The method of claim 1 wherein the computingoperations comprise a first iteration and further comprising: selectinga different modeling parameter from the plurality of modeling parametersand repeating in a subsequent iteration the operations of computing atractable approximation and computing a lower bound using the newlyselected modeling parameter, if the lower bound is not satisfactorilyoptimized in the first iteration.
 3. The method of claim 1 whereincomputing a lower bound comprises: computing the lower bound of the logmarginal likelihood as a function of prior distributions of the modelingparameters.
 4. The method of claim 1 wherein computing a tractableapproximation of a posterior distribution comprises: computing avariational approximation of the posterior distribution of the selectedmodeling parameter.
 5. The method of claim 1 wherein one of theplurality of modeling parameters represents a mean of each of theStudent distribution components.
 6. The method of claim 1 wherein one ofthe plurality of modeling parameters represents a precision matrix ofthe Student distribution components.
 7. The method of claim 1 whereinone of the plurality of modeling parameters represents a labelingparameter of the Student distribution components.
 8. The method of claim1 wherein one of the plurality of modeling parameters represents ascaling parameter of a precision matrix of the Student distributioncomponents.
 9. The method of claim 1 wherein one of the plurality ofmodeling parameters represents a mixing coefficients parameter of theStudent distribution components.
 10. The method of claim 1 whereingenerating a probability density comprises: generating the probabilitydensity including the mixture of Student distribution components, themixture of Student distribution components being characterized by thecurrent estimates of the posterior distributions of the modelingparameters and an estimate of the number of degrees of freedom of eachStudent distribution component.
 11. The method of claim 1 furthercomprising: storing the current estimates of the posterior distributionsof the modeling parameters in a storage location.
 12. The method ofclaim 1 wherein the input set of data represents auditory speech datafrom an unknown number of speakers, and further comprising determining acorrect number of speakers from the probability density modeling theinput set of data.
 13. The method of claim 1 wherein the input set ofdata represents image segmentation data from images having regions ofdifferent characteristics.
 14. A computer program product encoding acomputer program for executing on a computer system a computer process,the computer process comprising: selecting a modeling parameter from aplurality of modeling parameters characterizing a mixture of Studentdistribution components; computing a tractable approximation of aposterior distribution for the selected modeling parameter based on aninput set of data and a current estimate of a posterior distribution ofat least one unselected modeling parameter in the plurality of modelingparameters; computing a lower bound of a log marginal likelihood as afunction of current estimates of the posterior distributions of themodeling parameters, the current estimates of the posteriordistributions of the modeling parameters including the computedtractable approximation of the posterior distribution of the selectedmodeling parameter; and generating a probability density modeling theinput set of data, the probability density including the mixture ofStudent distribution components , the mixture of Student distributioncomponents being characterized by the current estimates of the posteriordistributions of the modeling parameters, if the lower bound issatisfactorily optimized.
 15. The computer program product of claim 14wherein the computing operations comprise a first iteration and furthercomprising: selecting a different modeling parameter from the pluralityof modeling parameters and repeating in a subsequent iteration theoperations of computing a tractable approximation and computing a lowerbound using the newly selected modeling parameter, if the lower bound isnot satisfactorily optimized in the first iteration.
 16. The computerprogram product of claim 14 wherein computing a lower bound comprises:computing the lower bound of the log marginal likelihood as a functionof prior distributions of the modeling parameters.
 17. The computerprogram product of claim 14 wherein computing a tractable approximationof a posterior distribution comprises: computing a variationalapproximation of the posterior distribution of the selected modelingparameter.
 18. The computer program product of claim 14 wherein one ofthe plurality of modeling parameters represents a mean of each of theStudent distribution components.
 19. The computer program product ofclaim 14 wherein one of the plurality of modeling parameters representsa precision matrix of the Student distribution components.
 20. Thecomputer program product of claim 14 wherein one of the plurality ofmodeling parameters represents a labeling parameter of the Studentdistribution components.
 21. The computer program product of claim 14wherein one of the plurality of modeling parameters represents a scalingparameter of a precision matrix of the Student distribution components.22. The computer program product of claim 14 wherein one of theplurality of modeling parameters represents a mixing coefficientsparameter of the Student distribution components.
 23. The computerprogram product of claim 14 wherein generating a probability densitycomprises: generating the probability density including the mixture ofStudent distribution components, the mixture of Student distributioncomponents being characterized by the current estimates of the posteriordistributions of the modeling parameters and an estimate of the degreesof freedom of each Student distribution component.
 24. The computerprogram product of claim 14 wherein the computer process furthercomprises: storing the current estimates of the posterior distributionsof the modeling parameters in a storage location.
 25. The computerprogram product of claim 14 wherein the input set of data representsauditory speech data from an unknown number of speakers, and furthercomprising determining a correct number of speakers from the probabilitydensity modeling the input set of data.
 26. The computer program productof claim 14 wherein the input set of data represents image segmentationdata from images having regions of different characteristics.
 27. Asystem comprising: a modeling parameter selector selecting a modelingparameter from a plurality of modeling parameters characterizing amixture of Student distribution components; a tractable approximationmodule computing a tractable approximation of a posterior distributionfor the selected modeling parameter based on an input set of data and acurrent estimate of a posterior distribution of at least one unselectedmodeling parameter in the plurality of modeling parameters; a lowerbound optimizer module computing a lower bound of a log marginallikelihood as a function of current estimates of the posteriordistributions of the modeling parameters, the current estimates of theposterior distributions of the modeling parameters including thecomputed tractable approximation of the posterior distribution of theselected modeling parameter; and a data model generator generating aprobability density modeling the input set of data, the probabilitydensity including the mixture of Student distribution components, themixture of Student distribution components being characterized by thecurrent estimates of the posterior distributions of the modelingparameters, if the lower bound is satisfactorily optimized.
 28. Thesystem of claim 27 wherein the lower bound optimizer computes 21 thelower bound of the log marginal likelihood as a function of priordistributions of the modeling parameters.
 29. The system of claim 27wherein the tractable approximation module computes a variationalapproximation of the posterior distribution of the selected modelingparameter.
 30. The system of claim 27 wherein one of the plurality ofmodeling parameters represents a mean of each of the Studentdistribution components.
 31. The system of claim 27 wherein one of theplurality of modeling parameters represents a precision matrix of theStudent distribution components.
 32. The system of claim 27 wherein oneof the plurality of modeling parameters represents a labeling parameterof the Student distribution components.
 33. The system of claim 27wherein one of the plurality of modeling parameters represents a scalingparameter of a precision matrix of the Student distribution components.34. The system of claim 27 wherein one of the plurality of modelingparameters represents a mixing coefficients parameter of the Studentdistribution components.
 35. The system of claim 27 wherein the datamodel generator generates the probability density including the mixtureof Student distribution components, the mixture of Student distributioncomponents being characterized by the current estimates of the posteriordistributions of the modeling parameters and an estimate of the degreesof freedom of each Student distribution component.
 36. The system ofclaim 27 further comprising: a memory storing the current estimates ofthe posterior distributions of the modeling parameters.
 37. The systemof claim 27 wherein the input set of data represents auditory speechdata from an unknown number of speakers, and further comprisingdetermining a correct number of speakers from the probability densitymodeling the input set of data.
 38. The system of claim 27 wherein theinput set of data represents image segmentation data from images havingregions of different characteristics.
 39. A method comprising: computinga tractable approximation of a posterior distribution for a selectedmodeling parameter of a plurality of modeling parameters characterizinga mixture of Student distribution components based on an input set ofdata and a current estimate of a posterior distribution of at least oneunselected modeling parameter in the plurality of modeling parameters;determining whether current estimates of the posterior distributions ofthe modeling parameters are satisfactorily optimized, the currentestimates of the posterior distributions of the modeling parametersincluding the computed tractable approximation of the posteriordistribution of the selected modeling parameter; and modeling the inputset of data by the mixture of Student distribution components, themixture of Student distribution components being characterized by thecurrent estimates of the posterior distributions of the modelingparameters.
 40. The method of claim 39 wherein the computing operationand determining operation comprise a first iteration and furthercomprising: selecting a different modeling parameter from the pluralityof modeling parameters and repeating in a subsequent iteration theoperations of computing a tractable approximation and computing a lowerbound using the newly selected modeling parameter, if the lower bound isnot satisfactorily optimized in the first iteration.
 41. The method ofclaim 39 wherein the operation of determining whether current estimatesof the posterior distributions of the modeling parameters aresatisfactorily optimized comprises: computing a lower bound of the logmarginal likelihood as a function of prior distributions of the modelingparameters and a variational posterior distribution; and determiningwhether the lower bound satisfies a predetermined criterion of theselected modeling parameter.
 42. The method of claim 39 whereincomputing a tractable approximation of a posterior distributioncomprises: computing a variational approximation of the posteriordistribution.
 43. The method of claim 39 wherein one of the plurality ofmodeling parameters represents a mean of each of the Studentdistribution components.
 44. The method of claim 39 wherein one of theplurality of modeling parameters represents a precision matrix of theStudent distribution components.
 45. The method of claim 39 wherein oneof the plurality of modeling parameters represents a labeling parameterof the Student distribution components.
 46. The method of claim 39wherein one of the plurality of modeling parameters represents a scalingparameter of a precision matrix of the Student distribution components.47. The method of claim 39 wherein one of the plurality of modelingparameters represents a mixing coefficients parameter of the Studentdistribution components.
 48. The method of claim 39 wherein modeling theinput data comprises: generating the probability density including themixture of Student distribution components, the mixture of Studentdistribution components being characterized by the current estimates ofthe posterior distributions of the modeling parameters and an estimateof the degrees of freedom of each Student distribution component. 49.The method of claim 39 further comprising: storing the current estimatesof the posterior distributions of the modeling parameters in a storagelocation.
 50. A computer program product encoding a computer program forexecuting on a computer system a computer process, the computer processcomprising: computing a tractable approximation of a posteriordistribution for a selected modeling parameter of a plurality ofmodeling parameters characterizing a mixture of Student distributioncomponents based on an input set of data and a current estimate of aposterior distribution of at least one unselected modeling parameter inthe plurality of modeling parameters; determining whether currentestimates of the posterior distributions of the modeling parameters aresatisfactorily optimized, the current estimates of the posteriordistributions of the modeling parameters including the computedtractable approximation of the posterior distribution of the selectedmodeling parameter; and modeling the input set of data by the mixture ofStudent distribution components, the mixture of Student distributioncomponents being characterized by the current estimates of the posteriordistributions of the modeling parameters.
 51. The computer programproduct of claim 50 wherein the computing operation and determiningoperation comprise a first iteration and further comprising: selecting adifferent modeling parameter from the plurality of modeling parametersand repeating in a subsequent iteration the operations of computing atractable approximation and computing a lower bound using the newlyselected modeling parameter, if the lower bound is not satisfactorilyoptimized in the first iteration.
 52. The computer program product ofclaim 50 wherein the operation of determining whether current estimatesof the posterior distributions of the modeling parameters aresatisfactorily optimized comprises: computing a lower bound of the logmarginal likelihood as a function of prior distributions of the modelingparameters and a variational posterior distribution; and determiningwhether the lower bound satisfies a predetermined criterion.
 53. Thecomputer program product of claim 50 wherein computing a tractableapproximation of a posterior distribution comprises: computing avariational approximation of the posterior distribution of the selectedmodeling parameter.
 54. The computer program product of claim 50 whereinone of the plurality of modeling parameters represents a mean of each ofthe Student distribution components.
 55. The computer program product ofclaim 50 wherein one of the plurality of modeling parameters representsa precision matrix of the Student distribution components.
 56. Thecomputer program product of claim 50 wherein one of the plurality ofmodeling parameters represents a labeling parameter of the Studentdistribution components.
 57. The computer program product of claim 50wherein one of the plurality of modeling parameters represents a scalingparameter of a precision matrix of the Student distribution components.58. The computer program product of claim 50 wherein one of theplurality of modeling parameters represents a mixing coefficientsparameter of the Student distribution components.
 59. The computerprogram product of claim 50 wherein modeling the input data comprises:generating the probability density including the mixture of Studentdistribution components, the mixture of Student distribution componentsbeing characterized by the current estimates of the posteriordistributions of the modeling parameters and an estimate of the degreesof freedom of each Student distribution component.
 60. The computerprogram product of claim 50 wherein the computer process furthercomprises: storing the current estimates of the posterior distributionsof the modeling parameters in a storage location.
 61. A systemcomprising: a tractable approximation module computing a tractableapproximation of a posterior distribution for a selected modelingparameter of a plurality of modeling parameters characterizing a mixtureof Student distribution components based on an input set of data and acurrent estimate of a posterior distribution of at least one unselectedmodeling parameter in the plurality of modeling parameters; an optimizermodule determining whether current estimates of the posteriordistributions of the modeling parameters are satisfactorily optimized,the current estimates of the posterior distributions of the modelingparameters including the computed tractable approximation of theposterior distribution of the selected modeling parameter; and a datamodel generator modeling the input set of data by the mixture of Studentdistribution components, the mixture of Student distribution componentsbeing characterized by the current estimates of the posteriordistributions of the modeling parameters.
 62. The system of claim 61wherein optimizer module computes a lower bound of the log marginallikelihood as a function of prior distributions of the modelingparameters and a variational posterior distribution, and determineswhether the lower bound satisfies a predetermined criterion.
 63. Thesystem of claim 61 wherein the tractable approximation modules computesa variational approximation of the posterior distribution of theselected modeling parameter.
 64. The system of claim 61 wherein one ofthe plurality of modeling parameters represents a mean of each of theStudent distribution components.
 65. The system of claim 61 wherein oneof the plurality of modeling parameters represents a precision matrix ofthe Student distribution components.
 66. The system of claim 61 whereinone of the plurality of modeling parameters represents a labelingparameter of the Student distribution components.
 67. The system ofclaim 61 wherein one of the plurality of modeling parameters representsa scaling parameter of a precision matrix of the Student distributioncomponents.
 68. The system of claim 61 wherein one of the plurality ofmodeling parameters represents a mixing coefficients parameter of theStudent distribution components.
 69. The system of claim 61 whereinmodeling the input data comprises: generating the probability densityincluding the mixture of Student distribution components, the mixture ofStudent distribution components being characterized by the currentestimates of the posterior distributions of the modeling parameters andan estimate of the degrees of freedom of each Student distributioncomponent.
 70. The system of claim 61 further comprising: a memorystoring the current estimates of the posterior distributions of themodeling parameters.